The majority of the content in this resource is from the book Recommender Systems: The Textbook (Aggarwal 2016).

Although I use

\[r_{ij}=\text{rating of item } j \text{ by user } i\]

as the outcome of interest throughout this document, this can be more generally understood as

\[\text{user } i \text{'s affinity for item } j\]

The exact outcome of interest will depend on the recommendation context (e.g. the outcome of interest might be ‘probability of click’ in a web recommendation context).










Goals of Recommender Systems

[back to contents]

  1. Relevance: The primary goal of standard recommender systems is to highlight for each user the subset of items which are most relevant to them, meaning those items which they would be most interested in (the items with the most utility to them).

However, there are some important secondary goals that are also very important in many cases:

  1. Novelty: recommended items should be ones that a user has not seen before (or could not easily find on their own).

  2. Serendipity: item recommendations should sometimes be unexpected (pleasantly surprising) to the user. Serendipity is “finding valuable or pleasant things that are not looked for.” (Kaminskas and Bridge 2016)

image source: author

  1. Diversity: “In information retrieval.. [covering] a broad area of the information space increases the chance of satisfying the user’s information need” (Kaminskas and Bridge 2016). This is because the user’s intent is somewhat ambiguous (e.g. whether “bat” refers to an animal or to sporting equipment), and returning diverse results corresponding to the query makes it more likely that the user’s desired item is within the result set. This applies equally to recommender systems: since there is some ambiguity in understanding what the user wants, it is safer not to put all of our eggs into one basket.

  2. Coverage: “Coverage reflects the degree to which the generated recommendations cover the catalogue of available items” (Kaminskas and Bridge 2016). This is important both for users (it improves the usefulness of the system) and for business-owners (because showing users the same small subset of the item catalogue might impact stock level management of physical items, and also because there has been a societal shift in consumer demand for products in the long tail of the catalogue (Armstrong 2008)).

There is a good discussion of this topic (the multiple goals of recommender systems) and of how these outcomes can be optimized and objectively measured in the paper Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems (Kaminskas and Bridge 2016).

Some further notes:










Design Patterns

[back to contents]

TODO: need to complete this list!










Supervised Learning

[back to contents]

The problem of predicting a particular user’s affinity for a particular item can be framed as a supervised learning problem, allowing us to use any one of the available plethora of performant classification, regression or ranking models (generalized linear models, gradient boosting, random forest, deep learning etc.).

The user rating (affinity) for a particular item is modelled as a function of the user’s attributes, the item’s attributes and (possibly) information on the recommendation context:

\[\begin{array}{lcl} r_{ij} &=& f\Big(\mathbf{x}^{(user)}_i, \mathbf{x}^{(item)}_j, \mathbf{x}^{(context)}_k\Big) \\ \mathbf{x}^{(user)}_i &=& \text{vector of user attributes} \\ \mathbf{x}^{(item)}_j &=& \text{vector of item attributes} \\ \mathbf{x}^{(context)}_k &=& \text{vector of recommendation context information} \\ \end{array}\]

A strength of this method is that it can mitigate the recommendation cold start problem (in which recommendations cannot be generated for new users and/or items) by sharing information across users and items.

Some specific examples of recommender architectures based on supervised learning are:

  1. Factorization Machines

  2. The 2 Tower model

  3. The Wide & Deep model

  4. The Deep & Cross model










Content-Based Recommendation

[back to contents]

user_ID movie user_rating length_hours origin_country budget_millions genre description thrilling exploding adventure exotic drama love
46290 movie A 5 1.5 USA 50 action a thrilling exploding adventure 1 1 1 0 0 0
46290 movie B 3 2.0 India 60 drama an exotic adventure full of drama 0 0 1 1 1 0
46290 movie C 4 1.5 UK 100 action things exploding everywhere 0 1 0 0 0 0
46290 movie D 1 3.0 USA 4 romance an american love story drama 0 0 0 0 1 1

The premise behind content-based recommendation is that once a user has consumed multiple items, we can use the attributes of those items to build an item attribute preference profile for that user, and use this profile to recommend future items to them.

For example, in a food recommendation context we might learn that a certain user likes Italian and Indian food, but never orders red meat or chilli.

A content-based strategy is particularly effective in situations in which:

  1. There is rich and predictive item attribute data available (structured or raw text)

  2. Past user item consumption is predictive of future preferences

Compared to collaborative filtering:

  1. content-based recommendation can robustly recommend items with few or no ratings (i.e. it alleviates the cold start problem for new items)

  2. content-based recommendation tends to produce more relevant, if also more boring and obvious, recommendations. This is because it cannot recommend items outside of the users historic item attribute consumption (i.e. it won’t recommend items outside of the scope of what they have consumed in the past)

If users are able to explicitly specify their own preferences (like a knowledge-based recommendation system), then this information can be incorporated into their item attribute preference profile.

Content-Based Recommendation: Nearest-Neighbour Classification

TODO

Content-Based Recommendation: Raw Text Preprocessing

TODO










Creating User & Item Characterization Vectors

[back to contents]

TODO










Collaborative Filtering

[back to contents]

image source: author

Collaborative Filtering refers, in general, to the process of inferring the unobserved preferences of entities (e.g. users) using the observed preferences of other entities (e.g. other users). It is “collaborative” in the sense that entities (unwittingly) contribute information toward each others recommendations.

Collaborative Filtering: Neighbourhood: User-User Similarity

image source: author

  • Predict an unobserved rating for item \(j\) by user \(i\) as the average rating of item \(j\) across the \(k\) users most similar to user \(i\) who have rated item \(j\). This average can (if desired) be weighted proportionately to each user’s similarity to user \(i\).

  • Similarity between users is traditionally defined using the distance between their item rating vectors (i.e. theirs rows in the user/item matrix). Vector distance metrics such as cosine similarity, euclidean distance, manhattan distance etc. can be used for this.

  • ‘Closest \(k\) neighbours’ can (if desired) be replaced by ‘all neighbours within a distance of d’.

  • Compared to Item-Item Similarity, User-User Similarity tends to produce more serendipitous - if sometimes less relevant - recommendations (see also Combining User-User & Item-Item Similarity).

Collaborative Filtering: Neighbourhood: Item-Item Similarity

image source: author

  • Item-Item collaborative filtering is mathematically identical to User-User, except that the calculation is performed on columns of the user/item rating matrix rather than columns

    i.e. an unobserved rating for item \(j\) by user \(i\) is estimated as the average rating by user \(i\) over the \(k\) most similar items to item \(j\) that user \(i\) has rated.

  • Compared to User-User Similarity, Item-Item Similarity tends to produce more relevant - if sometimes more boring/obvious - recommendations (see also Combining User-User & Item-Item Similarity).

Collaborative Filtering: Neighbourhood: Combining User-User and Item-Item Similarity

Since any missing (unobserved) entry \(r_{ij}\) in the user/item ratings matrix can be estimated using either User-User Similarity (\(\hat{r}_{ij}^{(user)}\)) or Item-Item Similarity (\(\hat{r}_{ij}^{(item)}\)), a given missing entry can also be estimated as a weighted average of the two:

\[\hat{r}_{ij} \quad=\quad \alpha \space \hat{r}_{ij}^{(user)} + (1-\alpha) \space \hat{r}_{ij}^{(item)} \quad,\quad\alpha\in[0,1]\]

..where the hyperparameter \(\alpha\) can be used to control the balance between recommendation relevance and recommendation serendipity.

Collaborative Filtering: Matrix Factorization

Matrix factorization refers to the process of learning a low-rank approximation of the user/item ratings matrix, then using this low-rank representation to infer the missing (unobserved) ratings in the matrix.

image source: author

For more information, refer to Matrix Factorization (Latent Factor Models)

Collaborative Filtering: Neighbourhood: Graph-Based

<TODO: very short description here>

refer to Graph-Based Collaborative Filtering










Multi-Armed Bandits

[back to contents]

TODO










Vincent’s Lemma: Serendipity

[back to contents]

This is a simple and heuristic method (created by Vincent Warmerdam) that was found to substantially increase the coverage of item recommendations in a commercial movie recommendation context.

It is a system for next item recommendation.

It shares some conceptual similarities with Association Rules-Based Recommendation

\[\begin{array}{lcl} R_{j\rightarrow{}i} &=& \text{relative affinity for item } i \text{ among item } j \text{ consumers compared to the rest of the user population} \\ &=& \displaystyle\frac{\text{% of item } j \text{ consumers have also consumed item }i}{\text{% of non-item } j \text{ consumers have also consumed item }i} \\ &=& \displaystyle\frac{p(s_i\bigl|s_j)}{p(s_i\bigl|s_j^c)} \\ &\approx& \displaystyle\frac{p(s_i\bigl|s_j)}{p(s_i\bigl)} \\ \end{array}\]

For users who have consumed item \(j\), the item \(i\) with the highest score \(R_{j\rightarrow{}i}\) should be recommended to them.

For items with a low amount of user interaction, the \(R_{j\rightarrow{}i}\) score will be unstable. This could be alleviated by making the calculation Bayesian, and shrinking the metric using a chosen prior distribution.










Association Rules-Based Recommendation

[back to contents]

Association Rules are rules of the form

\[\overset{\text{antecedant set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{consequent set}}{\{...\}}\]

Very efficient algorithms such as the a priori algorithm have been devised for searching data for rules of this form.

Association Rules are particularly effective in the case of unary ratings.

Here are some of the ways that association Rules can be used to generate item recommendations:

  1. Item-wise Recommendations:

    • \(\overset{\text{item set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{item set}}{\{...\}}\)

    • Example: \(\{\text{bread, tomato}\}=>\{\text{cheese}\}\quad\) i.e. users who have bought bread and tomato should be recommended cheese.

  2. User-wise Recommendations

    • \(\overset{\text{user set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{user set}}{\{...\}}\)

    • Example: \(\{\text{alice, bob}\}=>\{\text{john}\}\quad\) i.e. if users “Alice” and “Bob” have bought bought an item then “John” is also likely to like it

  3. Profile Assocation Rules:

    • \(\overset{\text{user attribute set}}{\{...\}} \quad\overset{\text{implies}}{=>} \quad \overset{\text{item set}}{\{...\}}\)

    • Example: \(\{\text{male, age30-39, 2_children}\}=>\{\text{home loan}\}\quad\) i.e. a large proportion of male users in their 30s with 2 children have consumed the item “home loan”, making it a promising recommendation for this user segment










Sequential Pattern Mining

[back to contents]

TODO










Clustering-Based Recommendation

[back to contents]

TODO










Graph-Based Collaborative Filtering

[back to contents]

image source: author image source: author

image source: author

Here is a lovely resource on neural networks for graph modelling.










Matrix Factorization (Latent Factor Models)

[back to contents]

Matrix factorization refers to the process of learning a low-rank approximation of the user/item ratings matrix then using this low-rank representation to infer the missing (unobserved) ratings in the matrix.

image source: author

There are many different ways in which this matrix factorization can be performed, each of which has various different strengths and weaknesses. These variants are defined by the constraints which they impose on the latent factors. Imposing constraints on the latent factors will always decrease accuracy (increase error) on the observed (training) data, but these constraints can also improve model generalization (error on unobserved data) and increase model interpretability.

Here is a summary of factorization methods from (Aggarwal 2016) -

The Family of Matrix Factorization Methods
Method Constraints on factor matrices Advantages/disadvantages
Unconstrained none Highest quality solution
Good for most matrices
Regularisation prevents overfitting
Poor interpretability
Singular Value Decomposition (SVD) orthogonal basis Good visual interpretability
Out-of-sample recommendations
Good for dense matrices
Poor semantic interpretability
Suboptimal in sparse matrices
Maximum Margin none Highest quality solution
Resists overfitting
Similar to unconstrained
Poor interpretability
Good for discrete ratings
Non-Negative Matrix Factorization (NMF) non-negativity Good quality solution
High semantic interpretability
Loses interpretability with both like/dislike ratings
Less overfitting in some cases
Best for implicit feedback
Probabilistic Latent Semantic Analysis (PLSA) non-negativity Good quality solution
High semantic interpretability
Probabilistic interpretation
Loses interpretability with both like/dislike ratings
Less overfitting in some cases
Best for implicit feedback

Matrix factorization models can also be combined with other recommender models within a single model architecture. Refer to:

  1. Integrating Latent Factor Models with Arbitrary Models

  2. Hybrid Systems










Naïve Bayes Collaborative Filtering

[back to contents]

TODO










Knowledge-Based Recommendation

[back to contents]

Knowledge-based recommender systems generate recommendations by combining an explicit user query with domain knowledge/algorithmic intelligence (e.g. user attributes, item attributes, historic item consumption data, recommendation context, item availability etc.). In this way, they lie somewhere on the spectrum between a pure queryless recommender system and a search engine.

Knowledge-based recommendation is typically facilitated through an iterative/cyclic exchange between the user and an interactive user interface, for example:

\[\text{user query} >> \text{query result} >> \text{refined user query} >> \text{refined query result} >> \text{refined user query} >> \text{refined query result} >> ...\text{etc. (until user finds a satisfactory item)}\] Knowledge-based recommender systems are well-suited to recommendation contexts in which each item is unique, and not sold often (e.g. houses and cars).

Knowledge-Based Recommendation: Constraint-Based

User provides a set of constraints, and item recommendations are then provided from the subset of items matching the user constraints (the items within the subset can be further ranked by a relevance model). Example constraints for a house search:

  • “homes in East London”

  • “price < $100”

  • “bathrooms > 2”

After receiving the query result, the user can refine their constraint set (make the search more liberal or more conservative) and rerun the query.

Knowledge-Based Recommendation: Case-Based

User provides a case/target/anchor item, and the recommender system then finds other items similar to it, possibly along user-specified item dimensions. Example “please find me songs similar to “Come Together” by the Beatles.










Hybrid Systems

[back to contents]

TODO










Tradeoffs Between Different Recommendation Algorithms

[back to contents]

TODO










Factorization Machines

[back to contents]

Factorization Machines (Rendle (2010)) are simply linear regression models with interactions, but in which the model coefficients on the interaction terms are modelled using a latent factor model. This factorization helps to avoid overfitting and improve generalization, especially when modelling sparse data. This is achieved by breaking the independence between the interaction coefficients (i.e. allowing/forcing information sharing between them).

image source: Rendle (2010) The basic model is defined:

\[\begin{array}{lcl} \hat{y}_i(\overrightarrow{\mathbf{x}}_i) &:=& \underset{\text{standard linear regression}}{\underbrace{w_0 + \displaystyle{\sum_{i=1}^p}w_ix_i}} + \underset{\text{latent factor model of all 2-way interactions}}{\underbrace{\displaystyle{\sum_{i=1}^p\sum_{j=i+1}^p<\overrightarrow{\mathbf{v}}_i,\overrightarrow{\mathbf{v}}_j>}x_ix_j}} \\ \space &\space& w_0\in \mathbb{R}, \quad\overrightarrow{\mathbf{w}} \in \mathbb{R}^{p}, \quad \mathbf{V} \in \mathbb{R}^{p \times k} \\ \end{array}\]










Incorporating Context

[back to contents]

TODO










Session-Based Recommendation

[back to contents]

TODO










Wide & Deep Model

[back to contents]

Inspired by the observation that shallow models (like linear models) tend to be better at memorization (finding specific reoccurring patterns in data) while deep models tend to be better at generalization (approximating the structure of the underlying data-generating system), the Wide and Deep (Cheng et al. 2016) model is an architecture containing both a wide component and a deep component - an attempt to combine the specific strengths of these 2 different models in a unified framework.

Google evaluated this model for app recommendation on their app store Google Play.

image source: author

More specifically:










Deep & Cross Model

[back to contents]

The Deep & Cross (Wang et al. 2021) model … TODO










2-Tower Model

[back to contents]

TODO

image source: author

[back to contents]










Integrating Latent Factor Models with Arbitrary Models

[back to contents]

\[\begin{array}{lcl} \hat{r}_{ij} &=& \underset{\text{main effects:}\\\text{user generosity &}\\\text{item popularity}}{\underbrace{\quad o_i + p_j \quad}} + \underset{\text{matrix factorization/}\\\text{latent factor model}}{\underbrace{\quad \overrightarrow{u}_i \cdot \overrightarrow{v}_j \quad}} + \beta \space F\bigg(\underset{\text{content model}}{\underbrace{\overrightarrow{c}_i^{(user)}, \overrightarrow{c}_j^{(item)}}}, \underset{\text{collaborative model}}{\underbrace{\overrightarrow{r}_i^{(user)}, \overrightarrow{r}_j^{(item)}}}, \bigg)\\ \space &\space& \space \\ \hat{r}_{ij} &=& \text{predicted rating of item } j \text{ by user } i\\ o_i &=& \text{user } i \text{ bias (e.g. user who rates all items highly)} \\ p_j &=& \text{item } j \text{ bias (e.g. item that all users rate highly)} \\ \overrightarrow{u}_i &=& \text{latent factor representation of user } i\\ \overrightarrow{v}_j &=& \text{latent factor representation of item } j \\ \beta &=& \text{adjustment/weight of non-latent portion of model} \\ F() &=& \text{any chosen function (e.g. linear regression)} \\ \overrightarrow{c}_i^{(user)} &=& \text{user } i \text{ attribute/content vector} \\ \overrightarrow{c}_j^{(item)} &=& \text{item } j \text{ attribute/content vector} \\ \overrightarrow{r}_i &=& \text{user } i \text{ observed ratings vector (row of user/item ratings matrix)} \\ \overrightarrow{r}_j &=& \text{item } j \text{ observed ratings vector (column of user/item ratings matrix)} \\ \end{array}\]

This parameters of this entire model (including the latent factors \(\overrightarrow{u}_i\) and \(\overrightarrow{v}_j\)) can be optimized simultaneously.

Some possible choices of function \(F()\) are:










Cascade Ranking

[back to contents]

TODO










see: https://arxiv.org/pdf/2210.09890.pdf

References

Aggarwal, Charu C. 2016. Recommender Systems - the Textbook. Springer.
Armstrong, Robert. 2008. The Long Tail: Why the Future of Business Is Selling Less of More. Canadian Journal of Communication. Vol. 33. https://doi.org/10.22230/cjc.2008v33n1a1946.
Cheng, Heng-Tze, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, et al. 2016. Wide & Deep Learning for Recommender Systems. arXiv. https://doi.org/10.48550/ARXIV.1606.07792.
Guo, Huifeng, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine Based Neural Network for CTR Prediction. arXiv. https://doi.org/10.48550/ARXIV.1703.04247.
Kaminskas, Marius, and Derek Bridge. 2016. Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of Beyond-Accuracy Objectives in Recommender Systems. ACM Trans. Interact. Intell. Syst. Vol. 7. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2926720.
Rendle, Steffen. 2010. Factorization Machines. 2010 IEEE International Conference on Data Mining.
Wang, Ruoxi, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-Scale Learning to Rank Systems. ACM. https://doi.org/10.1145/3442381.3450078.